Life expectancy dataset

The dataset consists of data collected from WHO (health factors, life expectancy) and economic data from the United Nation website and was published on Kaggle (https://www.kaggle.com/kumarajarshi/life-expectancy-who). It has been observed that in the past 15 years, there has been a huge development in the health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Data from 2000-2015 for 193 countries has been collected and was merged together.

Most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories: Immunization related factors, Mortality factors, Economical factors and Social factors.

Predictor explanations:
Status: Is a country already developed or is it still developing?
Adult mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
Infant deaths: Number of Infant Deaths per 1000 population
Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
Percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
Measles: Measles - number of reported cases per 1000 population
BMI: Average Body Mass Index of entire population
Under five deaths: Number of under-five deaths per 1000 population
Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
GDP: Gross Domestic Product per capita (in USD)
Population: Population of the country
Thinness 1-19: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
Thinness 5-9: Prevalence of thinness among children for Age 5 to 9(%)
Income comp: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Schooling: Number of years of Schooling(years)

For this assignment, the life expectancy will be predicted via a linear regression model and a regression model with random forests. This to find relaitonships within the life expectancy data set from the WHO.

Libraries

library(tidyverse)
library(randomForest)
library(hexbin)

library(GGally)
library(ggplot2)
library(plotly)
library(ISLR)
library(MASS)
library(dplyr)
library(jtools)

Data Exploration

life_expectancy <- read.csv("life-expectancy.csv")

dim(life_expectancy)
## [1] 2938   22
summary(life_expectancy)
##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  Adult.Mortality infant.deaths       Alcohol        percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   463.94   1st Qu.:1.958e+05   1st Qu.: 1.60       
##  Median :  1766.95   Median :1.387e+06   Median : 3.30       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  5910.81   3rd Qu.:7.420e+06   3rd Qu.: 7.20       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  NA's   :448         NA's   :652         NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.50      1st Qu.:0.4930                  1st Qu.:10.10  
##  Median : 3.30      Median :0.6770                  Median :12.30  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :167                     NA's   :163
head(life_expectancy)
colnames(life_expectancy)[which(names(life_expectancy) == "Income.composition.of.resources")] <- "Income.comp"
life_expectancy$Status <- as.factor(life_expectancy$Status)
life_expectancy$Country <- as.factor(life_expectancy$Country)
str(life_expectancy)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country               : Factor w/ 193 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year                  : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                : Factor w/ 2 levels "Developed","Developing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Life.expectancy       : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality       : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths         : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol               : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure: num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B           : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles               : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                   : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths     : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                 : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure     : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria            : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS              : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                   : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population            : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years  : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years    : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.comp           : num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling             : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

1 Investigating correlations

For this part, we will check whether there is a correlation between the different variables within the life expectancy dataset. First the variables will be converted to a numerical format.

Heatmapping

data_var <- life_expectancy %>% select_if(is.numeric)

ggcorr(data_var, label = T, label_size = 2.2, label_round = 2, hjust = 1, size = 3.5, color = "black", layout.exp = 5, low = "red2", mid = "orange", high = "springgreen", name = "Correlation")

Life.expectancy is the strongest correlated with Income.comp (0.72) and Schooling (0.75), but it is negatively correlated with Adult.Mortality. If the mortality rate of the adult is high, the life expectancy of the people within this dataset will be low. But the strongest correlations are infant.deaths with under.five.deaths (exactly 1), thinness..1.19 years and thinness.5.9.years (0.94) and between percentage.expenditure and Gross Domestic Product per capita (GDP) (0.9). But we will create a scatterplot to visualise the strongest correlation of Life.expectancy, which is the correlation between between Life.expectancy and Schooling.

ggplot(data_var,
       aes(x = Life.expectancy, 
           y = Schooling)) +
  geom_point(aes(colour = Life.expectancy), size = 1)
## Warning: Removed 170 rows containing missing values (geom_point).

Polio, Diphtheria and Hepatitis B coverage

The Global Vaccine Action Plan (GVAP) has publicized the National Immunization Coverage Scorecards Estimates for 2018, in which the GVAP mentions their global immunization goals (https://www.who.int/publications/m/item/national-immunization-coverage-scorecards-estimates-for-2018). One of the goals is that all countries are to reach 90% national immunization coverage for diseases such as hepatitis B, diphtheria, tetanus, pertussis and polio. The World Health Organization states that the immunization coverage is a good indicator of health system performance (https://www.who.int/data/gho/indicator-metadata-registry/imr-details/95). In order to test how good of an indicator the immunization coverage is, visualizations will be made on the coverage of polio, diphtheria and hepatitis B.

life_expectancy_diseases <- life_expectancy %>% # Divide the diseases into two groups with ifelse()
                                  mutate(Polio = ifelse(Polio < 90, "<90% covered", ">=90% covered"),
                                         Polio = as.factor(Polio),
                                         Diphtheria = ifelse(Diphtheria < 90, "<90% covered", ">=90% covered"),
                                         Diphtheria = as.factor(Diphtheria),
                                         Hepatitis.B = ifelse(Hepatitis.B < 90, "<90% covered", ">=90% covered"),
                                         Hepatitis.B = as.factor(Hepatitis.B))

life_expectancy_diseases <- life_expectancy_diseases %>%
                                                        drop_na()

Polio coverage

life_expectancy_diseases %>% 
                         group_by(Polio) %>% 
                         summarise(count = n(), .groups = 'drop') %>% 
                         mutate(percentage = paste0(round(count/sum(count)*100, 2), "%"))
polio_plot <-  ggplot(life_expectancy_diseases, aes(x=Polio, y = Life.expectancy, fill = Polio)) +
                geom_boxplot() +
                scale_fill_manual(values=c("red", "green2")) +
                labs(x = "Polio Coverage", y = "Life Expectancy (Age)") +
                theme(legend.position = "none")

ggplotly(polio_plot)

The polio coverage is at 57.55%, which means that there are unfortunately still many countries that do not have a polio coverage of 90% or above. The median for countries with a polio coverage of 90% of above is at the age of 73.5, while the median for countries with a polio coverage of 90% or below is at the age of 65.35. That is a gap of over 8 years.

Diphtheria coverage

life_expectancy_diseases %>% 
                         group_by(Diphtheria) %>% 
                         summarise(count = n(), .groups = 'drop') %>% 
                         mutate(percentage = paste0(round(count/sum(count)*100, 2), "%"))
diphtheria_plot <- ggplot(life_expectancy_diseases, aes(x=Diphtheria, y = Life.expectancy, fill = Diphtheria)) +
                geom_boxplot() +
                scale_fill_manual(values=c("red", "green2")) +
                labs(x = "Diphtheria Coverage", y = "Life Expectancy (Age)") +
                theme(legend.position = "none")

ggplotly(diphtheria_plot)

The diphtheria coverage is at 57.31%, similar to that of polio. The median for countries with a diphtheria coverage of 90% of above is at the age of 73.6, while the median for countries with a polio coverage of 90% or below is at the age of 65.45. The diphtheria coverage is very similar to that of polio, which might indicate that vaccinations against polio and diphtheria are given at the same time.

Hepatitis B coverage

life_expectancy_diseases %>% 
                         group_by(Hepatitis.B) %>% 
                         summarise(count = n(), .groups = 'drop') %>% 
                         mutate(percentage = paste0(round(count/sum(count)*100, 2), "%"))
hepatitis_plot <-  ggplot(life_expectancy_diseases, aes(x=Hepatitis.B, y = Life.expectancy, fill = Hepatitis.B)) +
                geom_boxplot() +
                scale_fill_manual(values=c("red", "green2")) +
                labs(x = "Hepatitis B Coverage", y = "Life Expectancy (Age)") +
                theme(legend.position = "none")

ggplotly(hepatitis_plot)

The amount of countries that have a Hepatitis B coverage of 90% or above seem to be making out more or less half of the observations. In the same trend as the polio and diphtheria coverage, the hepatitis B coverage shows a higher life expectancy in countries where the coverage is at 90% or above. As it turns out, the immunization coverage appears to be a good indicator of health system performance, as the life expectancy is higher in countries where the coverage of diseases is at 90% or above.

Impact of Schooling

Life expectancy is the highest for those who had been educated beyond high school. So we can conclude that most of the older people in this dataset did have a longer life expectancy in relation to the people with a lower life expectancy. There are some few outliers as you can see in the plot, but the vast majority had > 10 years of education. This could also be summarised within the linear model between Life.expectancy and Schooling (lm_life) and plotted in a graph:

lm_life <- lm(formula = Schooling ~ Life.expectancy, data = data_var)
summary(lm_life)
## 
## Call:
## lm(formula = Schooling ~ Life.expectancy, data = data_var)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4111  -1.1027   0.0164   1.2696   6.1774 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.643467   0.313559  -21.19   <2e-16 ***
## Life.expectancy  0.268828   0.004481   59.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.206 on 2766 degrees of freedom
##   (170 observations deleted due to missingness)
## Multiple R-squared:  0.5655, Adjusted R-squared:  0.5653 
## F-statistic:  3599 on 1 and 2766 DF,  p-value: < 2.2e-16

Linear Regression

effect_plot(lm_life, pred = Life.expectancy, interval = TRUE, partial.residuals = TRUE, colors = "magenta", point.color = "darkorchid4")

2 Random Forest modelling

Methodology

Random forest out of the box can’t handle missing values so we use the roughfix method which follows the following approach: “NAs are replaced with column medians …. This is used as a starting point for inputing missing values by random forest missfill= 1,2 does a fast replacement of the missing values, for the training set (if equal to 1) and a more careful replacement (if equal to 2). mfixrep= k with missfill=2 does a slower, but usually more effective, replacement using proximities with k iterations on the training set only. (Requires nprox >0).”

Creating a training and testing set

set.seed(1337)
smp_size <- floor(0.8 * nrow(life_expectancy))
train_ind <- sample(seq_len(nrow(life_expectancy)), size = smp_size)
df_train <- life_expectancy[train_ind,]
df_test <- life_expectancy[-train_ind,]
rf_model <- randomForest(Life.expectancy ~ ., 
                         data = df_train %>% dplyr::select(!c(Country)),
                         na.action = na.roughfix
                         )
rf_model
## 
## Call:
##  randomForest(formula = Life.expectancy ~ ., data = df_train %>%      dplyr::select(!c(Country)), na.action = na.roughfix) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 3.541
##                     % Var explained: 96.03

Plot 1

mse <- function(y_true, y_pred) {
  mean((y_true - y_pred)^2)
}

imp <- data.frame(importance(rf_model)) %>% arrange(desc(IncNodePurity)) 
ggplot() +
  geom_bar(aes(x=reorder(rownames(imp), -imp$IncNodePurity),y=imp$IncNodePurity), stat="identity") +
  labs(y = "Feature Importance", x = "Feature") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

The random forest model has shown us that the schooling variable is not as important as some other variables. For example, the barplot shows that the number of HIV deaths plays a huge role as a predictor for life expectancy. The countries where HIV leads to huge numbers of deaths are generally underdeveloped and have worse living conditions, as well as schooling; which are possible contributors to the high HIV death tolls. The human development index serves as a statistic composition of life expectancy, education and per capita income. Therefore, it also makes sense that this is an important variable.

Plot 2

pred <- predict(rf_model, newdata = df_test)
ggplot(df_test) +
  geom_hex(aes(x=pred, y = Life.expectancy), bins = 60) +  
  scale_fill_continuous(type = "viridis") +
  theme_bw() +
  geom_abline()
## Warning: Removed 268 rows containing non-finite values (stat_binhex).

The plot shows how well the predictions on the test set fit to the actual values. We see that most of the values are very close to the diagonal line, indicating a good fit of the model. The random tree model has not been able to predict the countries that have actually have a very high life expectancy.

3 Best regression model

pred_plot <- function(model) {
  # Created a test set for lstat in order to make predictions
  x_pred <- df_test$Schooling
  y_pred <- predict(model, newdata = tibble(Schooling = x_pred))
  
  # Plot with the real data and a prediction line
df_train %>%
    ggplot(aes(x = Schooling , y = Life.expectancy)) +
    geom_point(aes(color=Status)) +
    geom_line(data = tibble(Schooling = x_pred, Life.expectancy = y_pred), size = 1, col = "red") +
    theme_minimal()
}
pred_plot2 <- function(model) {
  # Created a test set for lstat in order to make predictions
  x_pred <- df_test$HIV.AIDS
  y_pred <- predict(model, newdata = tibble(HIV.AIDS = x_pred))
  
  # Plot with the real data and a prediction line
df_train %>%
    ggplot(aes(x = HIV.AIDS , y = Life.expectancy)) +
    geom_point(aes(color=Status)) +
    geom_line(data = tibble(HIV.AIDS = x_pred, Life.expectancy = y_pred), size = 1, col = "red") +
    theme_minimal()
}
#Linear regression life expectancy
lr_life <- lm(Life.expectancy ~ Schooling, data = df_train)
#pred_plot(lr_life)
#Polynomial 3rd degree
#pn3_life <- lm(Life.expectancy ~ poly(Schooling, 3), data = df_train)
pn3_life <- lm(Life.expectancy ~ Schooling + I(Schooling^2) + I(Schooling^3), data = df_train)
#pred_plot(pn3_life)
# Piecewise constant
# create new dataset without missing data
newdata <- na.omit(df_train) 
sections <- c(-Inf, quantile(newdata$Schooling, probs = c(.2, .4, .6, .8)), Inf)
pwc_life <- lm(Life.expectancy ~ cut(Schooling, sections), data = newdata)
#pred_plot(pwc_life)
# Piecewise polynomial

piecewise_cubic_basis <- function(vec, knots = 1) {
  # If there is only one section, just return the 3rd order polynomial
  if (knots == 0) return(poly(vec, degree = 3, raw = TRUE))
  
  # cut the vector
  cut_vec <- cut(vec, breaks = knots + 1)
  
  # initialise a matrix for the piecewise polynomial
  out <- matrix(nrow = length(vec), ncol = 0)
  
  # loop over the levels of the cut vector
  for (lvl in levels(cut_vec)) {
    
    # temporary vector
    tmp <- vec
    
    # set all values to 0 except the current section
    tmp[cut_vec != lvl] <- 0
    
    # add the polynomial based on this vector to the matrix
    out <- cbind(out, poly(tmp, degree = 3, raw = TRUE))
    
  }
  
  out
}
pwp_life <- lm(Life.expectancy ~ piecewise_cubic_basis(Schooling, 3), data = df_train)
# Cubic spline
library(splines)
bs_life <- lm(Life.expectancy ~ bs(HIV.AIDS, knots = median(Schooling)), data = df_train)
#pred_plot2(bs_life)
# Natural spline

ns3_life <- lm(Life.expectancy ~ ns(HIV.AIDS, df = 3), data = df_train)
#pred_plot2(ns3_life)
library(cowplot)
plot_grid(
  pred_plot(lr_life) + ggtitle("Linear regression"),
  pred_plot(pn3_life) + ggtitle("Polynomial"),
  pred_plot(pwc_life) + ggtitle("Piecewise constant"),
  pred_plot(pwp_life) + ggtitle("Piecewise polynomial")

)
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).
## Warning: Removed 138 rows containing missing values (geom_point).
## Warning: Removed 30 row(s) containing missing values (geom_path).

With this grid graph we can see that overall developed countries have more years of schooling and thus, more life expectancy.

plot_grid(
  pred_plot2(bs_life) + ggtitle("Cubic spline"),
  pred_plot2(ns3_life) + ggtitle("Natural spline")
)
## Warning in predict.lm(model, newdata = tibble(HIV.AIDS = x_pred)): prediction
## from a rank-deficient fit may be misleading
## Warning: Removed 8 rows containing missing values (geom_point).

## Warning: Removed 8 rows containing missing values (geom_point).

On the contrary, populations with high rate of HIV are not even registered and the prediction becomes unclear but we can observe that the points indicate low life expectancy in that case. The best life prediction is seen at 0% HIV, where developed countries are placed.